Xorbits is a powerful tool for exploring and analyzing large datasets. One of the classic datasets for demonstrating the capabilities of Xorbits is the NYC taxi dataset, which contains records of taxi rides in the city from 2009 to 2022. In this blog, we will explore how to use Xorbits to do some initial exploration of the NYC taxi dataset and get a sense of what kind of insights we might be able to gain from the data.
The first step is to initialize Xorbits:
import xorbits
# Initialize Xorbits in the local environment.
xorbits.init()
The second step is to load the data into an Xorbits DataFrame. This can be done using the
read_parquet() function, which allows us to specify the location of the parquet file and
any additional options we want to use while reading the data.
In the case of the NYC taxi dataset, here is an example of how we could do this using Xorbits:
import datetime
import json
import xorbits.pandas as pd
trips = pd.read_parquet("path/to/yellow_tripdata_2022-*.parquet")
# Remove outliers
trips = trips[(trips['tpep_pickup_datetime'] >= datetime.datetime(2022, 1, 1)) & (trips['tpep_pickup_datetime'] <= datetime.datetime(2022, 12, 31))]
taxi_zones = pd.read_csv('path/to/taxi+_zone_lookup.csv')
with open('path/to/taxi_zones.geojson') as fd:
geojson = json.load(fd)
Once we have the data loaded into a DataFrame, we might want to get a sense of the overall structure of the data by looking at the number of rows and columns, the data types of each column, and the first few rows of the data. We can do this using the shape, dtypes, and head() attributes, respectively:
print(trips.shape)
print(trips.dtypes)
print(trips.head())
(nan, 19) VendorID int64 tpep_pickup_datetime datetime64[ns] tpep_dropoff_datetime datetime64[ns] passenger_count float64 trip_distance float64 RatecodeID float64 store_and_fwd_flag object PULocationID int64 DOLocationID int64 payment_type int64 fare_amount float64 extra float64 mta_tax float64 tip_amount float64 tolls_amount float64 improvement_surcharge float64 total_amount float64 congestion_surcharge float64 airport_fee float64 dtype: object
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \ 0 1 2022-10-01 00:03:41 2022-10-01 00:18:39 1.0 1 2 2022-10-01 00:14:30 2022-10-01 00:19:48 2.0 2 2 2022-10-01 00:27:13 2022-10-01 00:37:41 1.0 3 1 2022-10-01 00:32:53 2022-10-01 00:38:55 0.0 4 1 2022-10-01 00:44:55 2022-10-01 00:50:21 0.0 trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \ 0 1.70 1.0 N 249 107 1 0.72 1.0 N 151 238 2 1.74 1.0 N 238 166 3 1.30 1.0 N 142 239 4 1.00 1.0 N 238 166 payment_type fare_amount extra mta_tax tip_amount tolls_amount \ 0 1 9.5 3.0 0.5 2.65 0.0 1 2 5.5 0.5 0.5 0.00 0.0 2 1 9.0 0.5 0.5 2.06 0.0 3 1 6.5 3.0 0.5 2.05 0.0 4 1 6.0 0.5 0.5 1.80 0.0 improvement_surcharge total_amount congestion_surcharge airport_fee 0 0.3 15.95 2.5 0.0 1 0.3 9.30 2.5 0.0 2 0.3 12.36 0.0 0.0 3 0.3 12.35 2.5 0.0 4 0.3 9.10 0.0 0.0
We can also use the describe() method to get a summary of the statistical properties of
each numerical column in the dataset.
print(trips.describe())
VendorID passenger_count trip_distance RatecodeID \
count 3.300328e+07 3.188340e+07 3.300328e+07 3.188340e+07
mean 1.716089e+00 1.400064e+00 6.051177e+00 1.426007e+00
std 4.862856e-01 9.669177e-01 6.135931e+02 5.824383e+00
min 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
25% 1.000000e+00 NaN 1.100000e+00 NaN
50% 2.000000e+00 NaN 1.900000e+00 NaN
75% 2.000000e+00 NaN 3.550000e+00 NaN
max 6.000000e+00 9.000000e+00 3.896785e+05 9.900000e+01
PULocationID DOLocationID payment_type fare_amount extra \
count 3.300328e+07 3.300328e+07 3.300328e+07 3.300328e+07 3.300328e+07
mean 1.646682e+02 1.624247e+02 1.188556e+00 1.459878e+01 1.008178e+00
std 6.537942e+01 7.027724e+01 5.121790e-01 1.047507e+02 1.246117e+00
min 1.000000e+00 1.000000e+00 0.000000e+00 -2.564000e+03 -2.218000e+01
25% 1.320000e+02 1.130000e+02 1.000000e+00 7.000000e+00 0.000000e+00
50% 1.620000e+02 1.620000e+02 1.000000e+00 1.000000e+01 5.000000e-01
75% 2.340000e+02 2.340000e+02 1.000000e+00 1.600000e+01 2.500000e+00
max 2.650000e+02 2.650000e+02 5.000000e+00 4.010923e+05 3.350000e+01
mta_tax tip_amount tolls_amount improvement_surcharge \
count 3.300328e+07 3.300328e+07 3.300328e+07 3.300328e+07
mean 4.891569e-01 2.707103e+00 5.272037e-01 2.961781e-01
std 9.192201e-02 3.247909e+00 2.023945e+00 4.730772e-02
min -5.500000e-01 -4.100000e+02 -9.999000e+01 -3.000000e-01
25% 5.000000e-01 9.300000e-01 0.000000e+00 3.000000e-01
50% 5.000000e-01 2.150000e+00 0.000000e+00 3.000000e-01
75% 5.000000e-01 3.350000e+00 0.000000e+00 3.000000e-01
max 2.548000e+01 1.400160e+03 9.118700e+02 1.000000e+00
total_amount congestion_surcharge airport_fee
count 3.300328e+07 3.188340e+07 3.188340e+07
mean 2.133517e+01 2.282358e+00 9.578618e-02
std 1.052695e+02 7.487687e-01 3.357228e-01
min -2.567800e+03 -2.500000e+00 -1.250000e+00
25% 1.230000e+01 NaN NaN
50% 1.595000e+01 NaN NaN
75% 2.274000e+01 NaN NaN
max 4.010956e+05 2.750000e+00 1.250000e+00
One way to analyze the NYC taxi dataset is to look at how the number of rides varies over time.
We can do this by creating a new column in the DataFrame that represents the pick-up date of
each ride, and then use the groupby method to group the data by month or year and compute the
count of rides for each group:
trips['PU_date'] = trips['tpep_pickup_datetime'].dt.date
count = trips.groupby('PU_date', as_index=False).agg(count=('VendorID', 'count'))
print(count)
PU_date count 0 2022-01-01 63441 1 2022-01-02 58421 2 2022-01-03 72405 3 2022-01-04 74562 4 2022-01-05 74592 .. ... ... 300 2022-10-28 132215 301 2022-10-29 127966 302 2022-10-30 108386 303 2022-10-31 101467 304 2022-11-01 9 [305 rows x 2 columns]
We can then use a library like plotly to visualize the time series data:
import plotly.express as px
b = px.bar(count.to_pandas(), x='PU_date', y='count')
b.show()
Another way to analyze the NYC taxi dataset is to look at patterns in the spatial distribution of rides. Taking Manhattan as an example, we firstly filter the dataframe by pick-up location ID:
manhattan_zones = taxi_zones[taxi_zones['Borough'] == 'Manhattan']['LocationID']
manhattan_trips = trips[trips['PULocationID'].isin(manhattan_zones)]
print(manhattan_trips)
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \
0 1 2022-10-01 00:03:41 2022-10-01 00:18:39 1.0
1 2 2022-10-01 00:14:30 2022-10-01 00:19:48 2.0
2 2 2022-10-01 00:27:13 2022-10-01 00:37:41 1.0
3 1 2022-10-01 00:32:53 2022-10-01 00:38:55 0.0
4 1 2022-10-01 00:44:55 2022-10-01 00:50:21 0.0
... ... ... ... ...
3588288 2 2022-05-31 23:07:04 2022-05-31 23:19:19 NaN
3588289 2 2022-05-31 23:15:58 2022-05-31 23:24:08 NaN
3588291 2 2022-05-31 23:52:12 2022-06-01 00:06:40 NaN
3588292 2 2022-05-31 23:27:00 2022-05-31 23:48:00 NaN
3588294 2 2022-05-31 23:00:53 2022-05-31 23:07:54 NaN
trip_distance RatecodeID store_and_fwd_flag PULocationID \
0 1.70 1.0 N 249
1 0.72 1.0 N 151
2 1.74 1.0 N 238
3 1.30 1.0 N 142
4 1.00 1.0 N 238
... ... ... ... ...
3588288 2.59 NaN None 164
3588289 1.86 NaN None 114
3588291 2.91 NaN None 144
3588292 3.96 NaN None 161
3588294 1.43 NaN None 238
DOLocationID payment_type fare_amount extra mta_tax tip_amount \
0 107 1 9.50 3.0 0.5 2.65
1 238 2 5.50 0.5 0.5 0.00
2 166 1 9.00 0.5 0.5 2.06
3 239 1 6.50 3.0 0.5 2.05
4 166 1 6.00 0.5 0.5 1.80
... ... ... ... ... ... ...
3588288 231 0 13.52 0.0 0.5 3.73
3588289 107 0 10.88 0.0 0.5 1.57
3588291 256 0 11.68 0.0 0.0 2.51
3588292 145 0 14.54 0.0 0.5 0.00
3588294 41 0 10.05 0.0 0.5 2.21
tolls_amount improvement_surcharge total_amount \
0 0.0 0.3 15.95
1 0.0 0.3 9.30
2 0.0 0.3 12.36
3 0.0 0.3 12.35
4 0.0 0.3 9.10
... ... ... ...
3588288 0.0 0.3 20.55
3588289 0.0 0.3 15.75
3588291 0.0 0.3 16.99
3588292 0.0 0.3 17.84
3588294 0.0 0.3 15.56
congestion_surcharge airport_fee PU_date
0 2.5 0.0 2022-10-01
1 2.5 0.0 2022-10-01
2 0.0 0.0 2022-10-01
3 2.5 0.0 2022-10-01
4 0.0 0.0 2022-10-01
... ... ... ...
3588288 NaN NaN 2022-05-31
3588289 NaN NaN 2022-05-31
3588291 NaN NaN 2022-05-31
3588292 NaN NaN 2022-05-31
3588294 NaN NaN 2022-05-31
[29403245 rows x 20 columns]
Then use the groupby method to group the data by pick-up location ID and compute the count of
rides for each group:
gb_pu_location = manhattan_trips.groupby(['PULocationID'], as_index=False).agg(count=('VendorID', 'count')).to_pandas()
We can then use a library like plotly to visualize the spatial distribution of rides:
m = px.choropleth(
gb_pu_location,
geojson=geojson,
locations='PULocationID',
featureidkey='properties.location_id',
color='count',
color_continuous_scale="Viridis",
range_color=(0, gb_pu_location['count'].max()),
labels={'count':'trips count'}
)
m.update_geos(fitbounds="locations", visible=False)
m.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
m.show()